A Second-Order CASSCF Algorithm with the Cholesky Decomposition of the Two-Electron Integrals

In this contribution, we present the implementation of a second-order CASSCF al-gorithm in conjunction with the Cholesky decomposition of the two-electron repulsion integrals. The algorithm, called Norm-Extended Optimization, guarantees convergence of the optimization, but it involves the full Hessian of the wavefunction and is therefore computationally expensive. Coupling the second-order procedure with the Cholesky decomposition leads to a signiﬁcant reduction in the computational cost, reduced memory requirements, and an improved parallel performance. As a result, CASSCF calculations of larger molecular systems become possible as a routine task. The performance of the new implementation is illustrated by means of benchmark calculations on molecules of increasing size, with up to about 3000 basis functions and 14 active orbitals.


Introduction
The complete active space-self-consistent field (CASSCF) method [1][2][3] is a powerful tool to achieve a qualitatively correct description of strongly correlated systems.Thanks to its intrinsic multireference nature, it can be used to compute the structure and molecular properties of a large manifold of interesting systems that are poorly described with standard single-reference methods.These include many open-shell systems, molecules with stretched bonds, and therefore reactivity, excited states and others.It can also provide a starting point for subsequent high-level correlated treatments, such as internally-contracted multireference configuration interaction 4,5 (CI) and coupled cluster; [6][7][8][9][10][11][12][13] multireference perturbation theory such as CASPT2, 14,15 and NEVPT2; [16][17][18] or even quantum Monte Carlo methods. 19,20Unfortunately, the method suffers of three major complications that restrict its applicability.First, it is not a black box method, as it requires the user to select the active space for the calculation.While there are a few strategies to aid the selection, [21][22][23] achieving good results relies still on the user's chemical intuition and understanding of the system.Second, the CASSCF wavefunction's optimization problem is notoriously hard to converge.Third, the method is computationally very demanding.
The computational cost of a CASSCF calculation stems from two concurring factors.
The most prominent one is that the method requires to solve a full CI (FCI) problem in the active space.Due to the combinatorial scaling of FCI, the investigation of large active spaces is not possible using standard direct CI techniques.Approximations to the FCI wavefunction can be used to overcome this otherwise overwhelming barrier, the most common example being the use of a density-matrix renormalization group 24 (DMRG).However, many interesting systems can be successfully described with a relatively small active space (up to 12-14 electrons in as many orbitals).If a careful choice of the active space that allows to capture the static correlation of the wavefunction with a limited number of active orbitals is possible, the cost of the CI part is either negligible (for active spaces with less than 10 orbitals) or manageable with traditional implementations.In such cases, the cost of the calculation is dominated by the operations involving the manipulation of the electron repulsion integrals (ERIs).
Convergence problems can be mitigated, if not completely solved, by using an optimization algorithm with guaranteed convergence to the closest local minimum.Methods based on a restricted step second-order optimization offer such a guarantee and are, therefore, a very attractive option.However, as they involve the evaluation of the energy Hessian with respect to the variational parameters, i.e., orbital rotations and CI expansion coefficients, they are in general more expensive than their first-order counterparts and require cumbersome and involved implementations.Nevertheless, second-order CASSCF implementations have been successfully achieved and are based on two main algorithm.The first algorithm, originally proposed by Werner and Meyer 25 and further refined by Werner, Knowles and others, 26,27 is based on the definition of a model energy function which is infinite order in orbital rotations and that is optimized.The coupling between CI and orbital optimization is introduced up to the second-order, ensuring thus quadratic convergence.This algorithm shows excellent convergence properties and overall performances.A similar strategy has been followed by Sun et al. 28 and the resulting algorithm, which is based on an integral-direct implementation and can use DMRG as a CASSCF solver, exhibits impressive performances.A second choice is to use a more traditional trust-region second-order method, such as the Levenberg-Marquardt method. 29Augmented with an adaptive choice of the trust radius, as proposed by Fletcher (we refer to the global strategy as FLM), it is possible to prove that the FLM method always converges to the closest local minimum and that the rate of convergence is quadratic.A very efficient implementation of the FLM method, known as the Norm-Extended Optimization (NEO) algorithm, has been proposed by Jensen and coworkers. 30,31In this contribution, we follow the latter strategy, which we have previously implemented in the CFOUR 32,33 suite of programs.
A second-order CASSCF implementation requires to work with ERIs transformed in the molecular orbital (MO) basis with at least two indices spanning the full rank of MOs.The transformation of the ERIs from the atomic orbitals (AO) to the MO basis is expensive, requiring O(M N 4 ) floating point operations, where M is the number of internal and active orbitals and N the number of basis functions.Furthermore, it is not easily implemented in an efficient way.This is due to the fact that the ERIs matrix is usually too large to fit in memory, especially in the MO basis, which implies that the transformation involves slow disk I/O.Furthermore, the AO ERIs are computed (and stored) in an order that depends on the shell structure of the basis set for the specific system.As a consequence, the integrals are read (or recomputed, in integral direct implementations) in a system-dependent order, which makes the use of efficient BLAS routines 34,35 and, more in general, vectorization, particularly challenging.
Furthermore, it offers a compact representation of the ERIs that is well suited for vector, efficient implementations, as the Cholesky-decomposed ERIs can be often kept in memory with standard computer hardware and are easily manipulated using highly optimized level 3 BLAS routines.Furthermore, all the ERIs manipulation can be written as the sum of independent operations on a given Cholesky vector and are therefore very easy to parallelize.
In this contribution, we present an implementation of NEO CASSCF in the CFOUR suite of programs 32,33 based on the CD of the ERIs.The implementation is tested on several molecular systems of increasing size, for active spaces that go from small (CAS (6,6)) to large (CAS (14,14)) and using up to about 3000 basis functions.The paper is organized as follows.
In Section 2, the derivation of the NEO CASSCF method is reviewed.The implementation of the algorithm is discussed in Section 3 with a special focus on the Cholesky implementation.
In Section 4, benchmark calculations are presented for the purpose of showing the perfor-mance of the algorithm in the optimization of medium-to-large systems.Finally, concluding remarks and some perspectives on future developments are given is Section 5.

Norm Extended Optimization CASSCF
In this section, we recapitulate the main aspects of NEO CASSCF.First, the parametrization of the wavefunction is discussed in Section 2.1.Then, the NEO algorithm is briefly summarized in Section 2.2.Further details regarding the optimization algorithm can be found in Ref. 30 or in a previous paper by two of us. 63

Parametrization of the CASSCF wavefunction
The starting point for the following discussion is given by a set of molecular orbitals (MOs) , where N b is the number of basis functions.In CASSCF, the MOs are subdivided into three classes according to their allowed occupation number in a Slater determinant-namely internal, which are always doubly occupied; active, which are subjected to no restriction; and external, which are always empty.To distinguish an orbital among such classes, the following labels are used: i, j, k refer to inactive, u, v, x to active, a, b, c to external, and p, q, r to generic orbitals.Indices that run over the determinantal space are labelled with capital letters I, J.
A convenient parametrization for the wavefunction, first proposed by Jensen and Jørgensen, 30 is Here, is the current approximation to the wavefunction, or current expansion point (CEP).|c is the correction vector that collects the CI variational parameters and P = 1 − |0 0| is the operator that projects |c in the orthogonal complement of |0 thus keeping any redundant vector parallel to the CEP.
Orbitals variations are described through a unitary transformation, φ = ϕU , that is conveniently parametrized by using an exponential map where Êpq = σ â † pσ âqσ is the spin-traced singlet excitation operator.The variational parameters are given by the elements of the anti-symmetric matrix, κ.Since only rotations between different orbitals classes produce a variation in the energy, the expression for κ can be simplified as follow κ = Hence, κ is considered as a vector whose dimension is given by all non-redundant orbitals rotations, i.e.N rot = N int N act + N int N ext + N act N ext .

Optimization of the CASSCF wavefunction
Equation 1 is used to define a variational expression for the electronic energy that reads In equation 5, Ĥ is the non-relativistic Hamiltonian operator written in second quantization where êpqrs = Êpq Êrs − δ rq Êps , h pq are one-electron integrals, (pq|rs) are two-electron integrals written in Mulliken's notation, and E nuc is the nuclear repulsion term.A second-order algorithm can be developed by defining a quadratic model for the energy; therefore, we expand equation 5 in power series up to second order.To this end, it is useful to define a generic parameter point, x = (c, κ), and the reference one, x 0 = (c (0) , 0) such that In equation 8, E 0 is the reference energy, that is 0| Ĥ |0 , while g and G are respectively the electronic gradient and Hessian evaluated at the CEP.Analytical expression for such quantities can be obtained by direct differentiation of equation 8 and by exploiting the Baker-Campbell-Hausdorff (BCH) formula.The gradient is given as and the Hessian The minimization of the quadratic model directly leads to the Newton-Raphson (NR) equations.However, the radius of convergence of NR is small, and the Hessian can be non positive-definite at the beginning of the optimization leading to incorrect search directions.
To overcome this issue, a more robust strategy consists in using a trust-region optimization algorithm, e.g. the Levenberg-Marquardt (LM) method, 29 where the minimization is performed in a reduced domain such that the Hessian has the correct signature.The LM equations can be seen as diagonally shifted NR ones, where the shifting parameter controls the step length to be within a predefined trust radius R t .The Norm-Extended Optimization (NEO) algorithm 30,64 is an elegant way to recast the LM minimization problem into an eigenvalue-eigenvector one where L(α) is the gradient-scaled augmented Hessian matrix It can be shown that for ground-state optimization the optimal direction is given by the first eigenvector of L(α).Once y is given, the NEO step can be computed as Here, P is the matrix representation of the projector operator P .The step length is controlled by the parameter α and can be obtained by solving the equation Eventually, the trust radius is changed adaptively during the optimization procedure according to Fletcher's algorithm. 29If the energy increases, the step is discarded, and the trust radius is decreased.Otherwise, R t is either increased or left untouched based on the value of the ratio between the predicted variation of the energy and the actual one.The combined strategy-NEO plus Fletcher's update-leads to an algorithm that always converge to the closest local minimum for well behaved wavefunctions.

Implementation
In this section, the implementation of the NEO algorithm within the CFOUR 32,33 suite of programs is discussed.In Section 3.1, we present working expressions for the gradient and for the linear transformations that describe the action of the augmented Hessian matrix on a trial vector.In Section 3.2, the Cholesky Decomposition of the two-electron integrals is introduced.Details regarding a cost-effective implementation that exploits the Cholesky vectors are given for a specific example.

Direct NEO equations
The NEO algorithm can be thought as a two-level procedure.In the first level-the macroiterations-the parameter hyper-surface is scanned by updating the CEP and the MOs.In the second level-the micro-iterations-a specific NEO eigenvalue-eigenvector problem is iteratively solved with the intention of getting the optimal step direction.At each macroiteration an atomic orbitals (AO) to MO transformation is performed.Then, the orbital and CI gradient are assembled, and the electronic energy is calculated.The CASSCF energy can be written as where γ uv and Γ uvxy are the one-and two-body reduced density matrices respectively, which can be computed as the expectation value of the excitation operators F I pq are the elements of the inactive Fock matrix and E i is the energy contribution that stems from the inactive electrons and is called inactive energy Manipulation of equation 10 leads to an anti-symmetric expression for the orbital gradient In equation 20 we have introduced the generalized Fock matrix, whose elements can be written in terms of the inactive Fock matrix, active Fock matrix, and Q matrix.The last two are defined as follows: As equation 9 states, the CI gradient can be evaluated as the action of the Hamiltonian operator on |0 where the last term is a vector parallel to the CEP that stems from the presence of the projector operator in the wavefunction definition.
The iterative solution of the NEO eigenvalue-eigenvector problem (micro-iterations) requires setting up expressions for the matrix-vector product between the augmented Hessian and a trial vector The present implementation makes use of the split-Davidson algorithm, 64 where configurations-only, v conf = (v c , 0), and orbitals-only, v orb = (0, v o ), vectors are added in the Krylov-like subspace.This procedure allows to adaptively add to the subspace either v conf or v orb depending on the part that exhibits the largest residual.Here we report the expressions for the direct product where |v c = I v I |Φ I , and H = κ, Ĥ is the one-index transformed Hamiltonian operator.
In equation 26, we have introduced the transition gradient g T pq ; that is, a gradient computed with symmetrized transition density matrices.The first term of equation 28 is a gradientlike contribution computed with one-index transformed one-and two-electron integrals.It can be effectively computed by means of the transformed inactive Fock matrix, active Fock matrix, and Q matrix whose expressions are given below Explicit expressions for equation 28 are given in the Supporting Information.
The transformed matrices have to be computed at each step of the micro-iterations and together with the AO to MO transformation constitute the bottleneck of the algorithm when the chosen active space is small.A summary of the NEO algorithm is given in Figure 1.

NEO equations with Cholesky vectors
The ERI matrix is symmetric and positive semidefinite; therefore, it can be decomposed according to the Cholesky Decomposition (CD) (pq|rs) We compute the CD of the integrals using the partial pivoting algorithm proposed by Koch et al., 47 which has been implemented inside the Mainz integral package 65 (MINT) in CFOUR. 32,33The procedure stops whenever the residual of the diagonal is below a user defined threshold.Using the Cauchy-Schwarz inequality, it can be shown that the error on the reconstructed integrals is always lower equal than the threshold, so it can be controlled systematically.In equation 32, N ch is the number of Cholesky vectors generated; the higher the decomposition threshold the lower the number of Cholesky vectors.
The Cholesky representation of the integrals has been substituted in all equations, namely the Fock matrices, the transformed Fock matrices, and the active ERI matrix.In order to illustrate the implementation of the evaluation of the aforementioned quantities, we discuss in detail the calculation of the transformed Q-matrix.Implemented expressions for the transformed Fock matrices can be found in the Supporting Information.Inserting equation 32 into equation 31 we get: The first term of equation 33 can be straightforwardly computed from the Q matrix.The second term is evaluated by first assembling, for each Cholesky vector, the intermediate quantities and such that r vxy Regarding the last term, we notice that r v vr L K ry is the fully active part of the intermediate S K of equation 35.Hence, we define and where in equation 38 we exploited the symmetry of the two-body reduced density matrix.
Gathering together equation 36, 37, and 38, we can rewrite equation 31 as where A remarkable fact about the CD is that all the operations involving different Cholesky vectors can be performed independently.As a consequence, parallelization is easily achieved by distributing the Cholesky vectors among the available processors, with a final reduction to be performed on the computed quantities.For this reason, we implemented the Cholesky loop as the most external in order to parallelize it with shared-memory (OpenMP) directives. 66

Benchmarks
In this section we present benchmark calculations to illustrate the performance of the CD-CASSCF implementation.In all the calculation, convergence is achieved when the root mean square norm of both the orbital and CI gradient is below 10 −7 .The threshold for the Cholesky decomposition has been set to 10 −4 .The starting MOs are obtained from Restricted-Hartree-Fock (RHF) calculations.Point-group spatial symmetry was not used.This section is organized as follows.In Section 4.1, we compare the CD and standard implementations on a small set of medium-sized molecules.In Section 4.2, we present benchmark results on a set of medium-sized aromatic molecules, taken from ref. 27, discussing in detail the cost associated with the various operations involved in the NEO CASSCF calculation.Finally, in Section 4.3, we perform benchmark calculations on larger molecular systems, both in terms of number of basis functions and size of the active space.

Cholesky versus standard implementation
As a first analysis, we compare the performance of the Cholesky algorithm with respect to the standard one both in terms of computational cost and memory requirements.We selected 7 medium-sized molecular systems, catechol, naphthalene, pyridoxamine, 5,7-dimethylidene-2H,3H,5H,7H-thieno [3,4-b][1,4]dioxine (herein referred to as 2Me4HSdiox), indole, tryphtophan, and nicotine.The geometries were taken from ref. 67.The basis set used is Dunning's cc-pVTZ. 68The standard CASSCF implementation handles the one-and two-electron integrals by reading them from disk at each macroiterations.As disk I/O is poorly parallelizable on standard hardware, it is a serial implementation, contrary to its CD-based counterpart.
For consistency, we compare it to CD calculations run sequentially on a single CPU core.
Both traditionals and CD CASSCF calculations were run on a single core of an Intel Xeon Gold 5120 CPU running at 2.20 GHz, on a cluster node equipped with 128GB of RAM.The standard implementation uses a semi-direct algorithm, where the MO Hessian is explicitly calculated.Given the size of the systems involved in this comparison, this is the most efficient algorithm.On the contrary, the CD implementation exploits a fully direct procedure as described in the previous section.For all calculations with both the standard and CD codes, the relevant MO-transformed ERIs and Cholesky vectors, respectively, were kept in memory.
The exact CD of the ERI matrix would generate N b (N b + 1)/2 Cholesky vectors.We define a compression rate to measure the effectiveness of the truncated CD in reducing the dimension of the computational problem.We report in Table 1, for each of the selected molecules, the active space (CAS), the number of basis functions (N b ), the disk space needed to store the two-electron integrals, the compression rate, the average time (averaged over the number of macroiterations) spent to perform the AO to MO integral transformation (AO to MO), the number of macroiterations, and the total CPU wall time (Time) in minutes.As expected, the storage Table 1: Systems used for the comparison of the standard and CD implementations of CASSCF in CFOUR.For each molecule, we report the size of the active space (CAS), the number of basis functions (N b ), the compression factor f defined in eq.40, the size in gigabytes of the Cholesky vectors (CD) and two electron integrals (STD), the average time in minutes required for the tranformation of the ERIs from the AO to the MO basis (AO to MO) for the CD and standard (STD) implementations, the total number of macroiterations, and the overall elapsed (wall) time in minutes for the calculation using the CD or standard (STD) code, respectively.
Size requirements for the CD vectors are significantly lower than for the standard two-electron integrals.It is worth remarking that even for the largest system of this set, the Cholesky vectors can easily be kept in memory even on a standard desktop computer.This is one of the main advantages of using a reduced order approximation of the ERIs, as it allows to perform full in-core calculations, avoiding thus slow disk I/O operations.The traditional calculation is dominated in cost by the transformation of the integrals into the MO basis, which has to be repeated at each macroiteration.In the standard code, the AO to MO transformation is performed by reading a batch of integrals from disk and then transforming the first index, which is restricted to internal and active orbitals.The other three indices are then transformed using level 3 BLAS matrix-matrix multiplications (DGEMM), making the first step the cost-dominating one.The total cost of this operation is O(M N 4 b ), where M is the number of internal and active orbitals.
In the CD code, on the contrary, the Cholesky vectors are transformed in core by performing two DGEMM matrix-matrix multiplications per vector.No disk I/O is required and the overall transformation can be carried out in an efficient, highly vectorized fashion.The total cost of the CD AO to MO transformation is O(N ch N 3 b ).It is apparent from Table 1 how this operation is no longer a bottleneck (see the next section for further discussions), thanks to both the compression stemming from the CD and to the handy data structure of the Cholesky vectors, that allows the procedure to be performed very efficiently.The reduction in the AO to MO timing is reflected also on the total execution time, which is one to two orders of magnitude lower in the CD code.
The chosen decomposition threshold (10 −4 ) allows us to obtain high compression rates while retaining an overall good accuracy.In Table 2, we report the converged CASSCF energy obtained with both the Cholesky and standard implementations for the previously selected molecules.As it can be seen from the table, the two results are in agreement to at least the fourth decimal digit, with the largest deviation being about 50µE h .We note that it has been documented in the literature that CD benefits from error cancellation, thus further increasing the accuracy of energy differences. 60,69

Benchmark calculations for medium-sized systems
The first benchmark set is composed of 21 aromatic molecules.The geometries were taken from Ref. 67; the set was used also by Kreplin et al. to test their MCSCF solver. 27,70As before, the basis set used is Dunning's cc-pVTZ, 68 for a total of 250 basis functions for the smallest molecule and 618 for the largest one.For each system, all the orbitals, including the core ones, are fully variationally optimized.All the calculations presented here were performed on a single cluster node equipped with two Xeon Gold 5120 CPUs, for a total of 28 cores, running at 2.20GHz.Shared memory parallelization is exploited in all the calculations.
We point out here that we do not expect the implementation to be fully scalable, the limiting factor being the Full CI code.This is due to the fact that the sequential code is highly cacheoptimized, which causes an overload of the cache, and consequent loss of efficiency, when more cores of the same processor share cache access.Nevertheless, even a simple-minded OpenMP parallelization of the main loops is beneficial.In Table 3 we reported for each molecule, the active space (CAS), the number of basis functions, the number of macro-iterations required to converge, and the total CPU wall time in minutes.As a first consideration, we note that the starting RHF orbitals used in this benchmark are a poor choice for CASSCF, as they cause the MO Hessian to be poorly conditioned.Using RHF orbitals is therefore a good way to test the robustness of the optimization algorithm, but not an optimal one for application sake.In practice, the ill-conditioning of the Hessian reflects both into slow convergence of the microiteration, which are Davidson iterations used to compute the lowest eigenvector of the NEO augmented Hessian, and in a larger number of macroiterations.Despite such difficulties, all calculations converged in at most 11 iterations and took less than 20 minutes, which demonstrates the robustness of the NEO algorithm and the overall efficiency of the implementation.
To further investigate the performance of the algorithm, we can subdivide the work into three main tasks-the AO to MO transformation, the optimization of the MOs (MOs opt.), and the optimization of the CI coefficients (CI opt.).The MOs optimization includes the calculation of the orbital gradient (eq.20), which in turn requires to assemble the various Fock matrices (eqs.18 and 21-22), the calculation of the diagonal of the MO Hessian (which is used as the preconditioner in the Davidson diagonalization), and the evaluation of the direct equations 27 and 28 for each micro-iterations.On the other hand, the CI optimization and evaluating equations 26 and 25 at each micro-iterations.Table 4 shows the percentage time to perform these three operations with respect to the total time of a specific macroiteration.Also, the total number of iterations required to solve the NEO problem (micro-It.) is reported.
The CD extremely facilitate the integrals transformation shifting the bottleneck to the MOs optimization part.For the systems considered, in particular, most of the time is spent in computing the transformed Fock matrices, an operation that is required to assemble the NEO Hessian-orbital trial vector product (eqs.27, 28).We also note that for larger active spaces, such as in biphenyl, the cost associated with the CI part starts to become non

Benchmark calculations for large systems
In order to test the new implementation on more challenging problems, we augment the benchmark set discussed in sec.4.2 with 10 larger molecules.The calculations involved active spaces up to CAS (14,14) and as many as 2962 basis functions.The geometries of the molecules were optimized at the B3LYP/6-31G(d) 71,72 level of theory using the Gaussian 16 suite of programs. 73All the structures can be found in the Supporting Information, a pictorial representation of the molecules is given in Figures 2, and 3.For all CD-CASSCF calculations, we used Dunning's cc-pVTZ basis set.The calculations were performed on the same cluster node used for the previous set, with the exception of the largest system (chlorophyll), for which we used a cluster node equipped with 1.2TB of memory and 4 Intel Xeon Gold 6140M CPUs running at 2.30GHz, for a total of 72 cores.
The total execution times for the first seven systems are comparable with the ones obtained for the previous set and show the overall good performance of the code, both in terms of total time and of convergence.The calculations on anthracene and resveratrol, for which  6 the percentage of the time spent performing the same operations discussed in Section 4.2 for the two molecules for the last macroiteration.Here, the most expensive operations are the direct-CI steps needed to compute the CI gradient and the CI part of the NEO augmented Hessian-configuration trial vector products, together with the assembling of the reduced density matrices.As it can be seen, these operations take about 80% of the total time.
Finally, the largest system tested, chlorophyll, can be thought as a pilot example of a large-scale application on a biologically relevant molecule.For this specific case, the storage of the Cholesky vectors in memory required more than 600GB of RAM, which is the reason why the calculation was performed on a different computer.The calculation converged in 15 that the price to pay for a second-order optimization lies in the cost of solving the NEO equations, which are by far dominating the overall cost of the calculation.Nevertheless, the guarantee of convergence remains an attractive feature of the method and the overall time required for the calculation is not excessive.

Conclusions
We have presented the implementation of a second-order CASSCF optimization algorithm that exploits the Cholesky Decomposition of the two electron integrals.The algorithm is based on a trust-region method, which requires to solve diagonally shifted Newton-Raphson equations known as Levenberg-Marquard (LM) equations.Also, it adaptively modifies the trust radius during the optimization according to the value of the energy with the result that the overall algorithm always converges to the closest minimum for regular enough functions.
The coupling between orbitals and CI coefficients is naturally included in the off-diagonal blocks of the Hessian matrix making this algorithm naturally second-order in all parameters.
The implementation is based on the Norm-Extended Optimization (NEO) formalism where the LM equations are recast into an eigenvalue problem, where the first eigenvector provides the optimal direction for ground-state minimization problems.
To reduce the computational cost associated with orbitals optimization, which is dominating for not-too-large active spaces, we implemented the NEO algorithm using the Cholesky Decomposition (CD) of the two-electron integrals matrix.The NEO equations were rewritten in terms of the Cholesky vectors, taking particular care in recasting all the equations in a way that allowed us to implement them efficiently thanks to an extensive use of level 3 BLAS routines.The implementation exploits a fully direct algorithm where the Hessian matrix is never explicitly calculated.Furthermore, since the Cholesky vectors are independent among the others, the code has been parallelized with shared-memory OpenMP directives.
The resulting algorithm was tested on various aromatic systems.We used a triple zeta basis set with up to 2962 functions and active spaces up to CAS (14,14).Despite the choice of a very poor guess for the orbitals, namely, Restricted-Hartree-Fock canonical orbitals, all the calculations converged swiftly and required limited computer time.Thanks to the effective compression of the two-electron integrals matrix operated by the CD, fully incore calculations are possible for most systems, eliminating thus the bottleneck of slow disk I/O.While several further improvements and optimizations are possible, for instance, to improve the convergence of the microiterations, the benchmark calculations reported in this contribution show that a rigorous second-order algorithm can be used in large-scale applications at a reasonable computational cost.Future work will focus on both algorithmic improvements and extensions of the methodology.In particular, a first order procedure such as super CI 74,75 could be used in the preliminary phase of a calculation to achieve an initial intermediate convergence goal, thus providing a very good starting point for the quadratically convergent optimization.We also plan to extend the second order procedure to the simultaneous optimization of several electronic states and to the calculation of analytical gradients, by implementing differentiated Cholesky vectors. 61,76

Figure 1 :
Figure 1: Macro/micro iterations scheme for the NEO algorithm The evaluation of the transformed matrix requires thus O(N ch (N 4 act + N 2 b N act + N b N 2 act )) floating point operations, that can all be performed using optimized level 2 and 3 BLAS routines and O(N b ) 2 words of memory for scratch.

Figure 2 :
Figure 2: New set of aromatic molecules used to test the CD-CASSCF algorithm together with their active spaces.

Figure 3 :
Figure 3: Structure of the chlorophyll molecule.

Table 2 :
Comparison between the converged CASSCF energy of the Cholesky and standard implementation.Energy values are in given in Hartree.

Table 4 :
Percentage time of the three leading operations with respect to the total time of a specific macroiteration (Time).AO to MO refers to the atomic orbitals to molecular orbitals transformation of the Cholesky vectors, MOs opt. is the time spent in the MOs optimization and includes operations such as: calculation of the orbital gradient, evaluation of the NEO augmented Hessian-orbital trial vector products.CI opt.refers to the CI optimization and include the following operations: calculation of the CI gradient, calculation of the reduced density matrices, and evaluation of the NEO augmented Hessian-configuration trial vector products.micro-It is the number of microiterations required to solve the NEO eigenvalueeigenvector problem.

Table 5 :
Results for the new set of aromatic molecules.For each molecule we reported the active space (CAS), the number of basis functions (N b ), the number of macroiterations (It.), and the total elapsed CPU (wall) time in minutes.

Table 6 :
Percentage time of the three main operations (AO to MO, MOs opt., CI opt.) with respect to the total time of the last macro-iteration.
macroiterations and took slightly more than 15 hours.It is interesting to look in more detail the cost associated to the various operations in a given macroiteration.Focusing on the ninth macroiteration as an example, which required 12 microiterations to converge, the AO to MO transformation of the Cholesky vectors took 11.7 minutes, and the MOs optimization lasted 54.9 minutes.The time required by the CI operations is negligible.Here, we see clearly